Purpose: Quantification, EDA and normalization for single-cell RNA-Seq data.
Rtsne for dim reduction, edgeR and limma for differential expression, ggplot2 for plotting, etc).This R/Bioconductor package includes:
SCESet (single-cell ExpressionSet) class in Bioconductorscater Overview
Citations:
Workflows:
To install the scater R/Bioconductor package
source("http://bioconductor.org/biocLite.R")
biocLite("scater")
library(scater)
## Loading required package: Biobase
## Loading required package: BiocGenerics
## Loading required package: parallel
##
## Attaching package: 'BiocGenerics'
## The following objects are masked from 'package:parallel':
##
## clusterApply, clusterApplyLB, clusterCall, clusterEvalQ,
## clusterExport, clusterMap, parApply, parCapply, parLapply,
## parLapplyLB, parRapply, parSapply, parSapplyLB
## The following objects are masked from 'package:stats':
##
## IQR, mad, xtabs
## The following objects are masked from 'package:base':
##
## anyDuplicated, append, as.data.frame, cbind, colnames,
## do.call, duplicated, eval, evalq, Filter, Find, get, grep,
## grepl, intersect, is.unsorted, lapply, lengths, Map, mapply,
## match, mget, order, paste, pmax, pmax.int, pmin, pmin.int,
## Position, rank, rbind, Reduce, rownames, sapply, setdiff,
## sort, table, tapply, union, unique, unsplit
## Welcome to Bioconductor
##
## Vignettes contain introductory material; view with
## 'browseVignettes()'. To cite Bioconductor, see
## 'citation("Biobase")', and for packages 'citation("pkgname")'.
## Loading required package: ggplot2
##
## Attaching package: 'scater'
## The following object is masked from 'package:stats':
##
## filter
data("sc_example_counts")
data("sc_example_cell_info")
scater assumes FASTQ files of individual cells have been assessed and removed based on quality control metrics based on bulk RNA-Seq metrics. For example, use fastqc, identify and remove libraries that are heavily degraded, libraries with a large amount of ribosomal, mitochondirla or other RNA type.
SCESet Data Class in scaterThis class holds the expression values. Class structure is derived from the ExpressionSet class in Bioconductor’s Biobase package (derived for microarray and bulk RNA-Seq analyses; allows assay data, gene/transcript metadata, and sample metadata). Adds additional slots to store:
To create a SCESet object, use newSCESet() and provided three data sets:
countData = numeric matrix of counts (rows = genes, columns = cells). In this case, the exprs slot in the object will be generated as log2(counts-per-million) using the edgeR::cpm() function, with a “prior count” value of 1.phenoData = An AnnotatedDataFrame describing phenotypic information for each cell (e.g. group, batch, etc)featureData = An AnnotatedDataFrame describing gene-level information for each gene (e.g. gc content, ensemble gene name, etc)You can also provide the normalized expression values without needing to provide the count data
exprsData = numeric matrix of expression values (rows = genes, columns = cells). The scater vignette states: “Typically these should be log2(counts-per-million) values or log2(reads-per-kilobase-per-million-mapped), appropriately normalised of course. The package will generally assume that these are the values to use for expression.” You can add the count data afterwards using the counts(object) function.newSCESet(): creates a new SCESet object. Requires the phenoData and featureData object to be an AnnotatedDataFrame.Example:
# define cell level data set (phenotypic data)
pdMat <- new("AnnotatedDataFrame", data = sc_example_cell_info)
rownames(pdMat) <- pdMat$Cell
# define feature level data set (gene level)
gene_df <- data.frame(Gene = rownames(sc_example_counts))
rownames(gene_df) <- gene_df$Gene
fdMat <- new("AnnotatedDataFrame", data = gene_df)
scMat <- newSCESet(countData = sc_example_counts, phenoData = pdMat, featureData = fdMat)
scMat
## SCESet (storageMode: environment)
## assayData: 2000 features, 40 samples
## element names: counts, cpm, exprs, is_exprs
## protocolData: none
## phenoData
## sampleNames: Cell_001 Cell_002 ... Cell_040 (40 total)
## varLabels: Cell Mutation_Status Cell_Cycle Treatment
## varMetadata: labelDescription
## featureData
## featureNames: Gene_0001 Gene_0002 ... Gene_2000 (2000 total)
## fvarLabels: Gene
## fvarMetadata: labelDescription
## experimentData: use 'experimentData(object)'
## Annotation:
# to add a different expression matrix e.g. with avg count of 5 to avoid log(0)
exprs(scMat) <- edgeR::cpm.default(counts(scMat), prior.count = 5, log = TRUE)
# or can create a new SCESet object with only expression data
# Note: no count data, phenoData or featureData needed, though not recommended.
scMat2 <- newSCESet(exprsData = edgeR::cpm.default(scMat, prior.count = 5, log = TRUE))
Uses the SCESet object:
is_exprs(object): returns a logical matrix indicating whether each feature expression observation is above the defined lowerDetectionLimit (default is 0). This can be determined on the count scale or the “expression” (i.e. exprs(object)) scale.calcIsExprs(object, lowerDetectionLimit = 100): varies the detection limit thresholdcounts(object): returns feature countscalculateQCMetrics(object, feature_controls = 1:92, cell_controls = list(set_1 = 1:5, set_2 = 31:40)): calculates quality control metrics. Can define feature and cell controls. Can define more than one set of feature/cell controls using list()scater_gui(object): opens an interactive dashboard to play around with data using a graphical user interface (GUI)is_exprs(scMat) <- calcIsExprs(scMat, lowerDetectionLimit = 4,
exprs_data = "exprs")
head(is_exprs(scMat))
filter out genes that are not expressed with a lowerDectionLimit of 5
keep_genes <- rowSums(is_exprs(scMat)) > 5
scMat <- scMat[keep_genes, ]
The authors state: “Following QC, we can proceed with data normalisation before downstream analysis and modelling.”
Authors split the QC metrics/plots into three steps:
Many of the functions in the QC are plots to explore and identify problems (e.g. features or cells that need to be removed). The function calculateQCMetrics() is used to compute commonly used QC metrics. Controls (features or cells) can be defined.
scMat <- calculateQCMetrics(scMat, feature_controls = 1:92, cell_controls = 1:5)
varLabels(scMat) # lists all QC metrics
## [1] "Cell"
## [2] "Mutation_Status"
## [3] "Cell_Cycle"
## [4] "Treatment"
## [5] "total_counts"
## [6] "log10_total_counts"
## [7] "filter_on_total_counts"
## [8] "total_features"
## [9] "log10_total_features"
## [10] "filter_on_total_features"
## [11] "pct_dropout"
## [12] "exprs_feature_controls"
## [13] "pct_exprs_feature_controls"
## [14] "filter_on_pct_exprs_feature_controls"
## [15] "pct_exprs_top_50_features"
## [16] "pct_exprs_top_100_features"
## [17] "pct_exprs_top_200_features"
## [18] "counts_feature_controls"
## [19] "pct_counts_feature_controls"
## [20] "filter_on_pct_counts_feature_controls"
## [21] "pct_counts_top_50_features"
## [22] "pct_counts_top_100_features"
## [23] "pct_counts_top_200_features"
## [24] "n_detected_feature_controls"
## [25] "exprs_endogenous_features"
## [26] "counts_endogenous_features"
## [27] "log10_exprs_feature_controls"
## [28] "log10_counts_feature_controls"
## [29] "log10_exprs_endogenous_features"
## [30] "log10_counts_endogenous_features"
## [31] "is_cell_control"
The calculateQCMetrics() function adds the following columns to pData(object):
| Cell-level QC metric | Description |
|---|---|
total_counts |
total number of counts for the cell (aka ‘library size’) |
log10_total_counts |
total_counts on the log10-scale |
total_features |
the number of features for the cell that have expression above the detection limit (default detection limit is zero) |
filter_on_total_counts |
would this cell be filtered out based on its log10-total_counts being (by default) more than 5 median absolute deviations from the median log10-total_counts for the dataset? |
filter_on_total_features |
would this cell be filtered out based on its total_features being (by default) more than 5 median absolute deviations from the median total_features for the dataset? |
counts_feature_controls |
total number of counts for the cell that come from (a set of user-defined) control features. Defaults to zero if no control features are indicated. |
counts_endogenous_features |
total number of counts for the cell that come from endogenous features (i.e. not control features). Defaults to total_counts if no control features are indicated. |
log10_counts_feature_controls |
total number of counts from control features on the log10-scale. Defaults to zero (i.e. log10(0 + 1), offset to avoid infinite values) if no control features are indicated. |
log10_counts_endogenous_features |
total number of counts from endogenous features on the log10-scale. Defaults to zero (i.e. log10(0 + 1), offset to avoid infinite values) if no control features are indicated. |
n_detected_feature_controls |
number of defined feature controls that have expression greater than the threshold defined in the object. |
pct_counts_feature_controls |
percentage of all counts that come from the defined control features. Defaults to zero if no control features are defined. |
Other notes about filtering cells:
counts appear above, similar metrics can be computed for exprs, tpm and fpkmThe calculateQCMetrics() function adds the following columns to fData(object):
| Feature-level QC metric | Description |
|---|---|
mean_exprs |
the mean expression level of the gene/feature. |
exprs_rank |
the rank of the feature’s expression level in the cell. |
total_feature_counts |
the total number of counts mapped to that feature across all cells. |
log10_total_feature_counts |
total feature counts on the log10-scale. |
pct_total_counts |
the percentage of all counts that are accounted for by the counts mapping to the feature. |
is_feature_control |
is the feature a control feature? Default is FALSE unless control features are defined by the user. |
n_cells_exprs |
the number of cells for which the expression level of the feature is above the detection limit (default detection limit is zero). |
TBA
SCESet objectMany plotting functions are available for visualising the data. Many of the plots create can facet or group the gene expression by factors.
plot()Gives an overview of expression across cells. Plots cumulative proportion of each cell’s library that is accounted for by the highest-expressed features (default showing top 500 features); shows differences in expression distributions across cells (similar idea to boxplots)
plot(object, block1 = "groupFactor1", block2 = "groupFactor2", nfeatures = 500, exprs_values = "counts")exprs(object)). Other values can be specified with exprs_values argument (e.g. exprs, tpm and fpkm)block1 and block2 are optional arguments that can create a faceted set of plots separated by groupFactor1 phenotypic factor information in phenoData (e.g. batch). Allows user to see large differences in distributions of expression across experimental blocks, batches, etc.plot(scMat)
## Warning in plotSCESet(x, ...): The object does not contain tpm expression
## values. Using exprs(object) values instead.
plot(scMat, block1 = "Mutation_Status", block2 = "Treatment",
colour_by = "Cell_Cycle", nfeatures = 300, exprs_values = "counts")
plotExpression()Plot expression levels for a defined set of features
plotExpression(object, rownames(scMat)[1:6], x = "groupFactor1", showMedian = FALSE, show_violin = TRUE)x argument splits each feature based on some phenotypic information (e.g. mutation status, batch)plotExpression(scMat, rownames(scMat)[1:6],
x = "Mutation_Status", exprs_values = "exprs")
plotExpression(scMat, rownames(scMat)[7:12],
x = "Mutation_Status", exprs_values = "counts", colour = "Cell_Cycle",
show_median = FALSE, show_violin = TRUE, xlab = "Mutation Status",
log = TRUE)
plotQC()Methods to produce various QC diagnostic plots.
plotQC(object, type = "highest-expression", exprs_values = "tpm")
type = "highest-expression": plot the most expressed features across the dataset. Default shows top 50 features.type = "exprs-freq-vs-mean": plot frequency of expression (number of cells with expression for a gene above the defined threshold) vs mean expression level. Gives an idea of technical noise in data set.multiplot function to plot multiple ggplot2 plots on the same page.pData(object))# filter features
keep_feature <- rowSums(is_exprs(scMat)) > 4
scMatSub <- scMat[keep_feature,]
## Plot QC
plotQC(scMatSub, type = "highest-expression")
plotQC(scMatSub, type = "exprs-freq-vs-mean")
plotFeatureData()Plot feature metadata and QC metrics. Can identify problematic cells by exploring relationship bewteen QC metrics computed from calculateQCMetrics(). Output is a ggplot2 object.
plotFeatureData(object, aes(x = n_cells_exprs, y = pct_total_counts))plotFeatureData(scMat, aes(x = n_cells_exprs, y = pct_total_counts))
Shows a small number of features that are ubiquitously expressed expressed in all cells (n_cells_exprs) and account for a large proportion of all counts observed (pct_total_counts; more than 0.5% of all counts).
plotPhenoData()Plot cell metadata and QC metrics. Output is a ggplot2 object.
plotPhenoData(object, aes(x = total_counts, y = total_features, colour = groupFactor1))plotPhenoData(scMat, aes(x = total_counts, y = total_features,
colour = Mutation_Status))
plotPhenoData(scMat, aes(x = Mutation_Status, y = total_features,
colour = log10_total_counts))
plotPCA()Produce a PCA plot for visualizing the cells. PCs calculated using 500 features with most variable expression across all cells (can be changed with ntop argument) from the exprs slot (but can be changed e.g. exprs_values="cpm").
plotPCA(object, ntop=500, exprs_values="exprs"): p
feature_set = fData(object)$is_feature_control: if control features are defined, you can use only these to calculate PCs.ncomponents = 4, colour_by = "Treatment", shape_by = "Mutation_Status": allows more than 2 PCs to be plotted and allows phenotypic information to be defined using colour, shapes, size etc.object <- plotPCA(object): the SCESet object has a reducedDimension slot which stores the top PCs. This can be accessed using the reducedDimension() function (e.g. reducedDimension(object)).plotPCA(scMat)
Other plot functions.
plotTSNE(object, colour_by = "Gene_0001", size_by = "Gene_1000"): produce a t-distributed stochastic neighbour embedding (reduced dimension) plot for the cellsplotDiffusionMap(object, colour_by = "Gene_0001", size_by = "Gene_1000"): produce a diffusion map (reduced dimension) plot for the cellsplotMDS(object): produce a multi-dimensional scaling plot for the cellsplotReducedDim(object): plot a reduced-dimension representation of the cellsExample:
plot(scMat)
plot_genes_jitter(sc[1:2,], grouping = "groupFactor", ncol = 2)
plot_spanning_tree(sc) # plots the order of the cells using PC1 and PC2.
plot_genes_in_pseudotime(sc_subset, color_by = "Time")
plot_genes_jitter(object, grouping = "groupFactor", ncol = 2) = Create plots of gene expression grouped by factors (only for a small number of genes). Based on the CellDataSet object.plot_spanning_tree(object) = Plots the minimum spanning tree on cells after applying the orderCells() function.plot_genes_in_pseudotime() = Plots expression for one or more genes as a function of pseudotime.